We share this use case to show all features of rANOMALY packages that allow to highlight microbial community differences between groups of samples. rANOMALY main features are :

  • ASV generation thanks to dada2 algorithm

  • Taxonomic assignment thanks to IDTAXA from DECIPHER packages. rANOMALY allow to assign with up to 2 complementary reference database (eg. SILVA which is generalist, DAIRYdb which is specific to dairy environments), keep the best and more confident taxonomic assignment, check for taxonomy incongruency like empty fields and multiple ancestors.

  • phyloseq format to handle datas and downstream statistical analysis.

  • Decontamination step thanks to decontam package. rANOMALY allows to filter ASV (Amplicon Sequence Variant) dependings on ASV overall frequence (low abondance filter), and ASV prevalence (presence in 1 or more samples), specific filters like manual suppression of specific taxa are available to.

  • Complete statistical analysis workflow: rarefaction curves, community composition plots, alpha/beta diversity analysis with associated statistical tests and multivariate analysis, and differential analysis using 3 three different methods (metacoder, DESeq2, metagenomeSeq).

  • At each steps, plots and R objects are saved in folders to be able to relaunch the analysis and to export plots.

1 Help

Each function has a detailed help accessible in R via ?{function}.

2 Tests datasets

The dataset can be downloaded via this link.

This tutorial assume that you have extracted all the read file in a folder named reads along with the sample-metadata.csv file.

We share a 24 samples test dataset extracted from rats feces at two different time (t0 & t50) and in two nutrition conditions. Rats were feeded with two strains of a particular bacteria (wild type & mutant). Also, extraction control sample (blank) are included too.

sm <- read.table("sample_metadata.csv", sep="\t",header=TRUE)
DT::datatable(sm)
load("decontam_out/robjects.Rdata")

3 Processing of raw sequences

3.1 ASV definition with DADA2

The first step is the creation of ASVs thanks to the dada2 package. In rANOMALY, only one function is needed to compute all the different steps require from this package. The function is designed to process illumina paired end sequences in fastq format.

Sample names are extracted from the file name, from the beginning to the first underscore (_), so files must be formatted as followed : {sample-id1}_R1.fastq.gz {sample-id1}_R2.fastq.gz etc… Those must match the sample names stored in the sample metadata file.

dada_res = dada2_fun(path="./reads", dadapool = "pseudo", compress=TRUE, plot=FALSE)

Main outputs:

  • ./dada2_out/read_tracking.csv summarizes the read numbers after each filtering step.

  • ./dada2_out/raw_otu-table.csv the raw ASV table

  • ./dada2_out/rep-seqs.fna: fasta file with all representative sequences for each ASV

  • ./dada2_out/robjects.Rdata with saved dada_res list containing raw ASV table and representative sequences in objects otu.table, seqtab.export & seqtab.nochim.

Read tracking table:

  • input: raw read number.

  • filtered: after dada2 filtering step: no N’s in sequence, low quality, and phiX.

  • denoisedF & denoisedR: after denoising. Forward & Reverse.

  • merged: after merging R1 & R2.

  • nonchim: after chimeras filtering.

DT::datatable(read.table("dada2_out/read_tracking.csv",sep="\t",header=TRUE), filter = "top")

3.2 Taxonomic assignment

assign_taxo_fun function uses IDTAXA function from DECIPHER package, and allows to use 2 different databases. It keeps the best assignment on 2 criteria, resolution (depth in taxonomy assignment) and confidence. The final taxonomy is validated by multiple ancestors taxa and incongruity correction step.

We share the latest databases we use in the IDTAXA format in this link. You can also generate your own IDTAXA formatted database following those instructions and scripts we provide at this page.

tax.table = assign_taxo_fun(dada_res = dada_res, id_db = c("path_to_your_banks/silva/SILVA_SSU_r132_March2018.RData","path_to_your_banks/DAIRYdb_v1.2.0_20190222_IDTAXA.RData") )

Main file outputs:

  • ./idtaxa/robjects.Rdata with taxonomy in phyloseq format in tax.table object.

  • final_tax_table.csv the final assignation table that will be use in next steps.

  • allDB_tax_table.csv raw assignations from the two databases, mainly for debugging.

3.3 Phylogenetic Tree

The phylogenetic tree from the representative sequences is generated using phangorn and DECIPHER packages.

tree = generate_tree_fun(dada_res)

Main outputs: - tree_robjects.Rdata with phylogenetic tree object in phyloseq format.

3.4 Phyloseq object

To create a phyloseq object, we need to merge four objects and one file:

  • the asv table dada_res and the representative sequences seqtab.nochim from dada2_robjects.Rdata

  • a taxonomy table (taxtable) taxo_robjects.Rdata from taxo_robjects.Rdata

  • the phylogenetic tree tree from tree_robjects.Rdata

  • metadata from sample-metadata.csv

data = generate_phyloseq_fun(dada_res = dada_res, tax.table = tax.table, tree = tree, metadata = "./sample_metadata.csv")
data
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 146 taxa and 24 samples ]
## sample_data() Sample Data:       [ 24 samples by 5 sample variables ]
## tax_table()   Taxonomy Table:    [ 146 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 146 tips and 144 internal nodes ]
## refseq()      DNAStringSet:      [ 146 reference sequences ]

Main output:

  • ./phyloseq/robjects.Rdata with phyloseq object in data for raw counts and data_rel for relative abundance.

3.5 Decontamination

The decontam_fun function uses decontam R package with control samples to filter contaminants. The decontam package offers two main methods, frequency and prevalence (and then you can combine those methods). For frequency method, it is mandatory to have the DNA concentration of each sample in phyloseq (and hence in the sample-metadata.csv). The prevalence method does not need DNA quantification, this method allows to compare presence/absence of ASV between real samples and control samples and then identify contaminants.

Tips: sequencing plateforms often quantify the DNA before sequencing, but do not automaticaly give the information. Just ask for it ;).

Our function integrates the basics ASV frequency (nb_reads_ASV/nb_total_reads) and overall prevalence (nb_sample_ASV/nb_total_sample) filtering. We included an option to filter out ASV based on their taxa names (for known recurrent contaminants).

data = decontam_fun(data = data, domain = "Bacteria", column = "type", ctrl_identifier = "control", spl_identifier = "sample", number = 100, krona= TRUE)

Main outputs: - robjects.Rdata with contaminant filtered phyloseq object named data. - Exclu_out.csv list of filtered ASVs for each filtering step. - Kronas plot before and after filtering. - raw_asv-table.csv & relative_asv-table.csv. - venndiag_filtering.png.

4 Plots, diversity and statistics

We are currently developping a ShinyApp to visualize your data, sub-select your samples/taxons and do all those analyses interactively !!! ExploreMetabar

4.1 Rarefaction curves

In order to observe the sampling depth of each samples we start by plotting rarefactions curves. Those plots are generated by plotly which makes the plots interactive.

rareplot = rarefaction(data, "strain_time", 100 )
htmltools::tagList(list(rareplot))

4.2 Composition plots

The bar_fun function outputs a composition plot, in this example it reveals the top 10 genus present in our samples. The function allows to plot at different rank and to modify the number of taxas to show.

  • Ord1 option order the sample along the X axis.

  • Fact1 option control labels of the X axis. Fact1="sample.id" if you don’t want the sample to be renamed.

4.2.1 Raw abundance

bars_fun(data = data, top = 10, Ord1 = "strain_time", Fact1 = "strain_time", rank="Genus", relative = FALSE)

4.2.2 Relative abundance

bars_fun(data = data, top = 10, Ord1 = "strain_time", Fact1 = "strain_time", rank="Genus", relative = TRUE)

4.3 Diversity analyses

4.3.1 Alpha diversity

This function computes various alpha diversity indexes, it uses the estimate_richness function from phyloseq.

Available measures : c(“Observed”, “Chao1”, “ACE”, “Shannon”, “Simpson”, “InvSimpson”, “Fisher”).

It returns a list which contains:

  • a boxplot comparing conditions. ($plot)

  • a table of indices values. ($alphatable)

And for each of the computed indices :

  • an ANOVA analysis. (${measure}$anova)

  • a pairwise wilcox test result comparing conditions and giving the pvalue of each comparison tested. (${measure}$wilcox_col1, ${measure}$wilcox_col2_fdr, ${measure}$wilcox_col2_collapsed)

  • a mixture model if your data include repetition in sampling, ie. column3 option. (${measure}$anovarepeat, ${measure}$mixedeffect)

All this in a single function.

alpha <- diversity_alpha_fun(data = data, output = "./plot_div_alpha/", column1 = "strain", column2 = "time", column3 = "", supcovs = "", measures = c("Observed", "Shannon") )

4.3.2 Table of diversity index values

The table of values for each indices you choose to compute.

pander(alpha$alphatable, style='rmarkdown')
  Observed Shannon
SB1.Sauv0 41 1.477
SB10.Mut0 40 2.073
SB11.Mut0 51 2.178
SB12.Mut0 38 2.116
SB13.Sauv50 46 2.691
SB14.Sauv50 57 2.905
SB15.Sauv50 50 2.793
SB16.Sauv50 52 2.8
SB17.Sauv50 49 2.624
SB18.Sauv50 54 2.831
SB19.Mut50 66 2.638
SB2.Sauv0 26 2.099
SB20.Mut50 72 2.721
SB21.Mut50 79 3.062
SB22.Mut50 81 2.81
SB23.Mut50 84 3.175
SB24.Mut50 90 3.148
SB3.Sauv0 19 0.1962
SB4.Sauv0 41 2.52
SB5.Sauv0 46 1.923
SB6.Sauv0 46 1.067
SB7.Mut0 33 2.256
SB8.Mut0 58 2.089
SB9.Mut0 50 2.237

4.3.3 Boxplots

The boxplots of those values.

ggplotly(alpha$plot) %>%
  layout(boxmode = "group")
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## 'font', 'title', 'uniformtext', 'autosize', 'width', 'height', 'margin', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'showlegend', 'colorway', 'datarevision', 'uirevision', 'editrevision', 'selectionrevision', 'template', 'modebar', 'meta', 'transition', '_deprecated', 'clickmode', 'dragmode', 'hovermode', 'hoverdistance', 'spikedistance', 'hoverlabel', 'selectdirection', 'grid', 'calendar', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'polar', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'editType', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'colorscale', 'coloraxis', 'metasrc', 'barmode', 'bargap', 'mapType'

4.3.4 Tests on Observed index

4.3.4.1 ANOVA results

For each indices, you have access to the ANOVA test. Here we present the result for the “Observed” indice.

pander(alpha$Observed$anova)
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
Depth 1 49.36 49.36 0.5091 0.4838
strain 1 1877 1877 19.36 0.0002764
time 1 3649 3649 37.64 5.392e-06
Residuals 20 1939 96.96 NA NA

4.3.4.2 Wilcox test

Wilcox tests are made on each factor you have entered, and the combination of the two. Here “strain” and “time”.

4.3.4.3 Wilcox test for “strain” factor

pander(alpha$Observed$wilcox_col1)
  mutant
wildtype 0.043

4.3.4.4 Wilcox test for “time” factor

pander(alpha$Observed$wilcox_col2_fdr)
  t0
t50 0.001

4.3.4.5 Wilcox test for the collapsed factors

pander(alpha$Observed$wilcox_col2_collapsed)
  mutant_t0 mutant_t50 wildtype_t0
mutant_t50 0.002 NA NA
wildtype_t0 0.377 0.005 NA
wildtype_t50 0.336 0.002 0.008

4.4 Beta diversity

The diversity_beta_fun function is an exhaustive function allowing to generate all possible tests and ordination in one command.

The diversity_beta_light function is equivalent to the first fonction but allows to generate specific tests and figures ready to publish in rmarkdown as in the example below. It is based on the vegan package function vegdist for the distance calculation and ordinate for the ordination plot.

We include statiscal test to ease the interpretation of your results. Here we include permutational ANOVA to compare groups and test that the centroids and dispersion of the groups are equivalent for all groups. User can inform col and cov arguments to assess PERMANOVA to determine significant differences between groups (eg. factor “strain” here and covariable “time”). A pairwise-PERMANOVA is processed to determine which conditions are significantly different from another (based on p-value).

As return, you will get an object that contains:

  • An ordination plot ($plot)

  • The permANOVA results ($permanova)

  • The pairwise permANOVA ($pairwisepermanova)

# beta <- diversity_beta_light(data = data, output = "./plot_div_beta/", glom = "ASV", column1 = "time", column2 = "strain", covar ="")
beta_strain = diversity_beta_light(data, col = "strain", cov="time", dist0 = "bray", ord0 = "MDS", output="./plot_div_beta_strain/", tests = TRUE)
beta_time = diversity_beta_light(data, col = "time", cov="strain", dist0 = "bray", ord0 = "MDS", output="./plot_div_beta_time/", tests = TRUE)

beta_strain_time = diversity_beta_light(data, col = "strain_time", dist0 = "bray", ord0 = "MDS", output="./plot_div_beta_strain_time/", tests = TRUE)

4.4.1 Ordination plot

htmltools::tagList(list(ggplotly(beta_strain$plot), ggplotly(beta_time$plot), ggplotly(beta_strain_time$plot)))

4.4.2 Statistical tests

  • permanova
pander(beta_strain$permanova)
Permutation: free
  Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
Depth 1 0.5075 0.5075 3.122 0.06845 0.01199
time 1 2.185 2.185 13.44 0.2946 0.000999
strain 1 1.471 1.471 9.049 0.1984 0.000999
Residuals 20 3.251 0.1626 NA 0.4385 NA
Total 23 7.415 NA NA 1 NA
  • pairwise permanova on concatenated factors
pander(beta_strain_time$pairwisepermanova)
Table continues below
pairs Df SumsOfSqs F.Model R2 p.value
wildtype_t0 vs mutant_t0 1 0.9528 4.64 0.317 0.008
wildtype_t0 vs wildtype_t50 1 2.021 28.97 0.7434 0.003
wildtype_t0 vs mutant_t50 1 2.197 26 0.7223 0.003
mutant_t0 vs wildtype_t50 1 1.681 11.59 0.5369 0.003
mutant_t0 vs mutant_t50 1 1.57 9.826 0.4956 0.003
wildtype_t50 vs mutant_t50 1 1.818 75.22 0.8827 0.004
p.adjusted sig
0.008 *
0.0045 *
0.0045 *
0.0045 *
0.0045 *
0.0048 *

4.5 Differential analyses

We choose three different methods to process differential analysis which is a key step of the workflow. The main advantage of the use of multiple methods is the cross validation of differentially abundant taxa between tested conditions.

4.5.1 Metacoder

Metacoder is the most simple differential analysis tool of the three. Counts are normalized by total sum scaling to minimise the sample sequencing depth effect and metacoder use a Kruskal-Wallis test to determine significant difference between sample groups. The metacoder_fun function allows the user to choose the taxonomic rank, which factor to the test (column1), and a specific pairwise comparison (comp) to launch the differential analysis.

It also produces pretty graphical trees, representing the taxa present in both groups and coloring the branches depending on the group in which this taxa is more abundant. Two of those trees are produced, the raw one and the non significant features filtered one (p-value <= 0.05).

out1 = metacoder_fun(data = data, output = "./metacoder", column1 = "strain_time", rank = "Family", signif = TRUE, plottrees = TRUE, min ="10", comp = "wildtype_t50~mutant_t50,wildtype_t0~mutant_t0")
  • Table
 DT::datatable(out1$table, filter = "top", options = list(scrollX = TRUE))
  • Comparaison 1
 out1$wildtype_t0_vs_mutant_t0$signif
## [[1]]

  • Comparaison 2
 out1$wildtype_t50_vs_mutant_t50$signif
## [[1]]

4.5.2 DESeq2

DESeq2 is a widely used method, primarily for RNAseq applications, for assessing differentially expressed genes between controlled conditions. Its use for metabarcoding datas is sensibly the same. The deseq2_fun allows to process differential analysis as metacoder_fun, and users can choose the taxonomic rank, factor to test and which conditions to compare. DESeq2 algorithm uses negative binomial generalized linear models with VST normalization (Variance Stabilizing Transformation).

Main output is a list with :

  • list\({comparison}\)plot: plot with significant features
  • list\({comparison}\)table: table with statistics (LogFoldChange, pvalue, adjusted pvalue…)
#fig.keep='all', fig.align='left', fig.width = 15, fig.height = 10
out2 = deseq2_fun(data = data, output = "./deseq/", column1 = "strain_time", verbose = 1, rank = "Family", comp = "wildtype_t50~mutant_t50,wildtype_t0~mutant_t0")
ggplotly(out2$wildtype_t50_vs_mutant_t50$plot)
DT::datatable(out2$wildtype_t50_vs_mutant_t50$table, filter = "top", options = list(scrollX = TRUE))

4.5.3 MetagenomeSeq

MetagenomeSeq use a normalization method able to control for biases in measurements across taxonomics features and a mixture model taht implements a zero-inflated Gaussian distribution to account for varying depths of coverage. As deseq2_fun, metagenomeseq_fun returns table with statistics and plots with significant features for each comparison.

out3 = metagenomeseq_fun(data = data, output = "./metagenomeseq/", column1 = "strain_time", verbose = 1, rank = "Family", comp = "wildtype_t50~mutant_t50,wildtype_t0~mutant_t0")
ggplotly(out3$wildtype_t50_vs_mutant_t50$plot)
DT::datatable(out3$wildtype_t50_vs_mutant_t50$table, filter = "top", options = list(scrollX = TRUE))

4.5.4 Aggregate methods

The aggregate_fun function allows to merge the results from the three differential analysis methods used before, to obtain one unique table with all information of significant differentially abundant features.

The generated table include the following informations :

  • seqid: ASV ID

  • Comparaison: Tested comparison

  • Deseq: differentially abundant with this method (0 no or 1 yes)

  • metagenomeSeq: differentially abundant with this method (0 no or 1 yes)

  • metacoder:differentialy abundant with this method (0 no or 1 yes)

  • sumMethods: sum of methods in which feature is significant.

  • DESeqLFC: Log Fold Change value from DESeq2

  • absDESeqLFC: absolute value of Log Fold Change value from DESeq2

  • MeanRelAbcond1: Mean relative abundance in condition 1

  • MeanRelAbcond2: Mean relative abundance in condition 2

  • Condition: in which the mean feature abundance is higher.

  • Taxonomy & representative sequence.

resF = aggregate_fun(data = data, metacoder = "./metacoder/metacoder_signif_Family.csv", deseq = "./deseq/", mgseq = "./metagenomeseq/", output = "./aggregate_diff/", column1 = "strain_time", column2 = NULL, verbose = 1, rank = "Genus", comp = "wildtype_t50~mutant_t50,wildtype_t0~mutant_t0")
ggplotly(resF$wildtype_t0_vs_mutant_t0$plot)
ggplotly(resF$wildtype_t50_vs_mutant_t50$plot)
DT::datatable(resF$table, filter = "top", options = list(scrollX = TRUE))

4.5.5 PLSDA (Partial Least Squares Discriminant Analysis)

Our PLS-DA/sparse PLS-DA analysis is based on the mixOmics package. These are supervised classification methods. They allow to define most important features discriminating our groups. The plsda_res function outputs a list of graphs and tables:

  • plsda$plotIndiv ordination plot from the PLSDA analysis.

  • splsda$plotIndiv ordination plot from the sparse PLSDA analysis.

  • loadings$comp{n} all the loadings plots for each composant from the sPLSDA analysis. Loadings weights are displayed and colors correspond to the group in which each feature has its maximum mean value.

  • splda.loading_table Loading of all taxas for each composant.

  • splsda$plotArrow arrow plot of samples.

plsda_res <- plsda_fun(data = data, output = "./plsda_family/", column1 = "strain_time", rank = "Family")
plsda_res$splsda.plotIndiv$graph

knitr::include_graphics('plsda_family/splsda_loadings_strain_time_Family_comp1.png')

knitr::include_graphics('plsda_family/splsda_loadings_strain_time_Family_comp2.png')

knitr::include_graphics('plsda_family/splsda_loadings_strain_time_Family_comp3.png')

knitr::include_graphics('plsda_family/splsda_loadings_strain_time_Family_comp4.png')

5 Miscellaneous function

5.1 Heatmap

heatmap_fun allows to plot relative abundance of more abundant taxas (20 by default). Use can choose which taxonomy rank to plot and the factor used to seprate plot.

heatmap_plot = heatmap_fun(data = data, column1 = "strain_time", top = 20, output = "./plot_heatmap/", rank = "Species")
ggplotly(heatmap_plot)

5.2 Shared taxa

  • Venn Diagram

ASVenn_fun allows to define shared taxa between specific conditions (up to 5 conditions). It generates Venn Diagram and table informing presence of each ASV in each condition with taxonomy and sequence. Counts can be verified with this table. rank argument allows to generate output with upper taxonomy level taxas.

outvenn = ASVenn_fun(data = data,output = "./ASVenn/", rank = "ASV", column1 = "strain_time", shared = TRUE)
  • Venn diagram:
grid.draw(outvenn$venn_plot)

  • Table:
DT::datatable(outvenn$TABf, filter = "top", options = list(scrollX = TRUE))
  • Cytoscape phy2cyto_fun allows to generate input files for cytoscape which is usefull to vizualise shared taxa. You can see below an example of representation with Cytoscape on our data. Each nodes and arrows position and aesthetic can be easily modified.

5.3 Supplementary tools and features

csv2phyloseq_fun allows to import the 4 files in tabulated format to generate phyloseq object ready to use with rANOMALY.

assign_fasta_fun is used to assign sequences from fasta files.

export_to_stamp_fun allows you to generate input files for STAMP.

split_table_fun allows to split the phyloseq object in multiple sub-object according to one column.

update_metadata_fun allows you to easily modify sample_data part of the phyloseq object.